Introduction to Open Data Science - Course Project

Week 1: Warm up

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Dec  8 20:32:27 2021"

“How are you feeling right now? What do you expect to learn? Where did you hear about the course?”

I am extremely excited about this course! I have just started my PhD studies at the Institute of Atmospheric and Earth System Research. My research topic is related to building a framework for climate competencies, based mostly on interviews and surveys. Since my background is in natural sciences I have very little to no experience on behavioral sciences and their methodology. For this reason, I have chosen this course. It was suggested to me by our colleagues in the faculty of educational sciences.


I expect to learn to use R, which I’m not too worried about since I have experience in Python, and even more importantly, the statistical practices and rules that apply in human sciences, when it’s not all about nature and numbers.


My repository

As text: https://github.com/joulasip/IODS-project.git


Week 2: Regression and model validation

date()
## [1] "Wed Dec  8 20:32:27 2021"


1 Description of the data


The data includes survey results from statistics course students in Finland regarding their learning. Data is collected 2014-2015. After combining variables that measure the same dimension and excluding observations where the exam points are zero, the data consists of 166 observations and 7 variables listed here:

  • gender
  • age
  • attitude
  • deep approach (maximized understanding with a true commitment)
  • surface approach (memorizing without understanding)
  • strategic approach (applying strategy to maximize learning)
  • points (max points)

The three learning approaches combine 8 subscales that all include 4 questions:

  • deep: Seeking meaning, relating ideas, use of evidence
  • surface: lack of purpose, unrelated memorizing, syllabus-boundness
  • strategic: organized studying, time management

Reading the data and checking the structure (working diary being IODS-project):

data <- read.csv("data/learning2014", row.names = 1)
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...


2 Graphical overview of the data


Summary of the data — basic statistics of the variables:

summary(data)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Graphical overview using libraries GGally and ggplot2. The first variable, gender, is defining the colour and alpha sets transparency in the graphs so that the overlapping data can be seen:

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
p <- ggpairs(data, mapping = aes(col=data$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The correlation coefficient, corr, refers to Pearson correlation by default — correlations equal to +1 or −1 correspond to data points lying exactly on a line. Stars in the end the correlation value represent the p-values (as explained here):

  • “***” p < 0.001
  • “**” p < 0.01
  • “*” p < 0.05
  • “.” p < 0.10


Interpretation:

There are almost twice as many female students than male. Majority of the students are between 20 and 30 years old. Male students have generally better attitude than female.

Points and attitude have the highest positive correlation 0.437, which is also significant (p < 0.001). Highest significant negative correlation can be seen between surface and deep learning approaches with corr = -0.324 (male up to -0.622). This is to be expected since they could be considered the opposite.

Strategic learning approach is slightly more common among female students than male.

Less significant negative correlation can be seen between surface and strategic learning approaches, corr = -0.161, and between surface learning approach and attitude, corr = -0.176.

The distribution of deep learning approach is shifted more towards higher values than with strategic or surface approach. The mean and median are approx. 3.6 — for strategic approach approx 3.1 and surface approach 2.8.

To me it seems odd, that the points do not correlate with deep learning approach at all. Could this mean that the exam/assignments do not measure the deep learning? Or that the deep learning approach does not in fact lead to learning? There is more evidence (all though no significance shown) that the surface learning approach has some negative correlation and strategic approach positive with the points.

3 Multiple regression


Since the points have a highly significant correlation with attitude and a slightly significant correlation with surface and strategic learning approaches, I have chosen those three for the multiple regression as explanatory variables to the target variable “points”.

my_model_multi <- lm(points ~ attitude + stra + surf, data = data)
summary(my_model_multi)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08


The following formula describes the multiple linear model:

\(y = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\)

where \(\beta_0\) is the Intersept, \(\beta_1\) regression coefficient (Estimate) for attitude, \(\beta_2\) for strategic approach and \(\beta_3\) for surface approach. These measure the change in the mean response associated with a change in the corresponding explanatory variable. \(\varepsilon_i\) refers to the error terms (Std. Error) that are assumed to have a normal distribution with zero mean and the same variance \(\sigma^2\) for all values of the explanatory variables (as explained in the course material book Multivariate Analysis for the Behavioural Sciences).


The residuals (difference between an observed value of the target variable and the fitted value) have a median approx. 0.52. F-statistic gives a test of an omnibus null hypothesis that all the regression coefficients are zero, and it is calculated by dividing regression mean square (RGSM) by residual mean square (RMS). RMS gives an estimate of variance \(\sigma^2\). Since here F-statistic has a very low associated p-value (3.156e-08) it is very unlikely that all of the coefficients are zero — as can be expected.

The t-values are obtained by dividing the estimated regression coefficient by the standard error of the estimate, and the associated significance levels (Pr(>|t|) in the table) can indicate the importance of the explanatory variable.


According to the model, attitude has the strongest and the most significant (p < 0.001) positive correlation with the points. Since only the attitude has a high significance, I will try making models with attitude and surface approach, and attitude and strategic approach separately. The latter summary shown below.

my_model_2 <- lm(points ~ attitude + stra, data = data)
summary(my_model_2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

4 Multiple R-squared and relationship between variables


\(R^2\) gives the proportion of variability in the target variable accounted for by the explanatory variables — in the first case approx. 21% of the variability in points is accounted for by these three explanatory variables: attitude, strategic learning approach and surface learning approach. \(R\) is a measure of the fit of the model, the multiple correlation coefficient.


In the model with only attitude and strategic learning approach as explanatory variables, (shown above) the strategic approach has p-value of 0.09 so still below 0.1, so slightly better. In this model, according to \(R^2\), approximately the same amount of the variability, 20%, can be explained by the combination of attitude and strategic approach than in the previous model with three explanatory variables. This gives confidence that the importance of surface learning approach is very insignificant.


Let’s still try how the model would look with only one explanatory variable, attitude. The resulted summary is shown below — attitude alone counts for 19% of the variability in points.

my_model_3 <- lm(points ~ attitude, data = data)
summary(my_model_3)
## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

It is important to note that the connections between the explanatory variables can also have a major effect on the model, and they need to be taken into account when deciding which variables to use and which leave out! Here the surface and strategic learning approaches do have small negative correlation according to the overview of the data (part 2) but still much less significant than the correlation between attitude and points, so I consider it not important.

5 Diagnostic plots

I will continue the analysis with the model including attitude and strategic learning approach as the explanatory variables.

par(mfrow = c(2,2))
plot(my_model_2, which = c(1,2,5))


The graph above includes the diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs. Leverage.

Assumptions of the model:

  • Linearity
  • Errors are normally distributed, not correlated and have a constant variance, \(\sigma^2\)


QQ-plot allows exploring the normality assumption — better the points fall into the line, better the assumption. The fit here is rather good especially in the middle, but there seem to be some outliers in both ends of the distribution that fall under the line.


Constant variance of errors and them not being dependent on the values of the explanatory variables can be assessed with the Residuals vs. fitted values graph. Any pattern in the plot would mean issues with this assumption. Here the plot has no clear patterns but the same outliers are marked in this one as well — the observations 145, 56 and 35 seem odd. Based on these two plots, I would look more deeply into those answers and would consider removing those observations all together if a clear reason for their behaviour is found.


Residuals vs leverage graph helps to identify observations that have unusually high impact on the model. Here again few observations stand out, and they are marked with numbers — 145 and 35 are there again. Since Cook’s distance is slightly higher around the same leverage values, it means that these values have a rather large influence on the model and should therefore be looked into more.


Overall the diagnostic plots look still quite good and the model can be considered valid. The same outliers are also present if the diagnostic plots are made for the model with only attitude as explanatory variable. Removing the outliers could lead to larger significance and clearer model results.



Week 3: Logistic regression

date()
## [1] "Wed Dec  8 20:32:33 2021"


1 Description of the data (1p)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(boot)


alc <- read.csv("data/alc", row.names = 1)
dim(alc)
## [1] 370  51
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

The data represents Portuguese secondary school students’ performance including grades, demographic, social and school related features. It was collected by using school reports and questionnaires. Two subjects are combined here: Mathematics and Portuguese language. More information can be found here including explanations of all the variables listed above: https://archive.ics.uci.edu/ml/datasets/Student+Performance

‘Alc_use’ combines weekday (Dalc) and weekend (Walc) alcohol use. ‘High_use’ is defined as TRUE if ‘alc_use’ > 2, so students consumption of alcohol is defined high if they are using alcohol twice a week or more.


2 Hypothesis (1p)

Variables that I assume having important correlation to alcohol consumption, and my hypothesis about the connection:

  • G3: final grade (numeric: from 0 to 20)
    • “final grade is negatively correlated with alcohol consumption”
  • goout - going out with friends (numeric: from 1 - very low to 5 - very high)
    • “going out with friends correlated strongly positively with alcohol consumption”
  • absences - number of school absences (numeric: from 0 to 93)
    • “absences can be a consequence of high alcohol consumption”
  • age - student’s age (numeric: from 15 to 22)
    • “younger students use more alcohol”

In addition, I will include gender as a divider in the graphs.


3 Numerical and Graphical exploration (5p)

Glimpse to the selected data again:

#select the wanted variables to check
sel_alc <- select(alc,one_of(c("G3","goout","absences","age")))
glimpse(sel_alc)
## Rows: 370
## Columns: 4
## $ G3       <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14, 1…
## $ goout    <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2, 5…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1, …
## $ age      <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1…


Bar plots of the chosen variables:

gather(sel_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Small amount of absences are common and only very few students have more than 10 absences. Most studetns are between 15 and 18 years ols and only few are older. Grades seem to follow normal distribution that is somewhat shifted towards better scores — 10 and 12 are the most received grades. Going out with friends is also quite normally distributed — most students responded with 3 or 2.



Making box plots of the variables as high use in the x-axis and gender defining the colour:

library(cowplot)

# initialize the plots for the 4 variables chosen
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade") + xlab("high use") + ggtitle("Student grade by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student absences by alcohol consumption and sex")
g3 <- ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("going out") + xlab("high use") + ggtitle("Going out by alcohol consumption and sex")
g4 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + xlab("high use") + ggtitle("Student age by alcohol consumption and sex")

plot_grid(g1,g2,g3,g4)


Producing summary statistics for each selected variable based on high use of alcohol and gender:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       11.4
## 2 F     TRUE        41       11.8
## 3 M     FALSE      105       12.3
## 4 M     TRUE        70       10.3

Grades: I hypothesized that the grades would be negatively affected by alcohol use. According to the box plot and the mean grade shown above, this is true only for male students — males not using too much alcohol have higher average grade than females (using alcohol often or not) and the males using alcohol often have notably lower average grade than females. The box plot shows that some male students have received very low grades in case of high alcohol consumption, which brings the average down quite a lot. For female students the variance is smaller, and no outliers are present (see box plot).


alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_absences
##   <chr> <lgl>    <int>         <dbl>
## 1 F     FALSE      154          4.25
## 2 F     TRUE        41          6.85
## 3 M     FALSE      105          2.91
## 4 M     TRUE        70          6.1

Absences: I assumed that the higher alcohol use would lead to absences. For both genders this seems to hold true. Male students not using alcohol have very low average of absences, only 2.9, so the difference to the high alcohol users is very clear, approx 3 more days of absences. With female students the difference is approx. 2.6 days. Female students have more absences in general and there are several students with much higher number of absences than the average — all the way to more than 40.


alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_goout
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       2.95
## 2 F     TRUE        41       3.39
## 3 M     FALSE      105       2.70
## 4 M     TRUE        70       3.93

Going out: The box plot for going out looks a bit odd due to the likert scale. Median among the less alcohol using students is 3 and those using more than twise a week 4 (male and female). Here also the hypothesis holds. Difference between the mean values is larger for male students according to the summary statistics above. Students going out more are also consuming more alcohol, which makes a lot of sense.


alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_age
##   <chr> <lgl>    <int>    <dbl>
## 1 F     FALSE      154     16.6
## 2 F     TRUE        41     16.5
## 3 M     FALSE      105     16.3
## 4 M     TRUE        70     16.9

Age: Mean age of those using alcohol many times a week is slightly smaller for female students and higher for male students than the age of non-alcohol users. This shows clearly in the median in the box plot. Female students maybe start drinking earlier than male students. It seems to hold true to some extent that younger students drink more — however, the students are generally quite young as shown in the bar plots earlier.



4 Logistic regression (5p)

Statistically exploring the data with logistic regression:

m <- glm(high_use ~ age + absences + G3 + goout, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + absences + G3 + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8967  -0.7525  -0.5409   0.9467   2.2946  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.77834    1.96186  -1.926  0.05412 .  
## age          0.04563    0.11022   0.414  0.67890    
## absences     0.07315    0.02237   3.270  0.00108 ** 
## G3          -0.04472    0.03923  -1.140  0.25435    
## goout        0.70668    0.11917   5.930 3.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 388.67  on 365  degrees of freedom
## AIC: 398.67
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)         age    absences          G3       goout 
## -3.77834234  0.04562870  0.07314823 -0.04471930  0.70667982
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %   97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## age         1.04668570 0.8430071337 1.299954
## absences    1.07589001 1.0312932049 1.127070
## G3          0.95626587 0.8852281993 1.032878
## goout       2.02724925 1.6139414455 2.577757

Odds ratios (OR) tell about the odds of high use of alcohol based on the explanatory variable. If the value is over 1, there is a positive effect of the explanatory variable to the target variable (high use). In this case, age, absences and going out increase the changes of high alcohol use, and alcohol use and grades have negative association. However, as the significance test in the model summary shows, only absences and going out have a statistical significance in the model.

In case the value 1 is between the confidence intervals (CI), there is no evidence of an association between the variables. In this case, only the confidence intervals of absences and going out don’t include the value 1, which means that those two variables and high alcohol use have association.

When it comes to the hypothesis before, this goes together with the previous analysis based on box plots. Absences have a clear positive correlation with the high alcohol use, as does going out with friends. Based on the previous, this model would look different regarding the grades if it was made for male students separately, whose grades were notably lower if they were using alcohol more than twice a week.


5 Predictive power of the model (3p)

For this exploration I will include the variables ‘absences’ and ‘goout’ according to the analysis above. Here I will use the model to predict the the actual values, and save and compare the predictions to the actual values. In case probability is larger than 0.5, the high use is considered TRUE.

m2 <- glm(high_use ~ absences + goout, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, goout, high_use, probability, prediction) %>% tail(10)
##     absences goout high_use probability prediction
## 361        7     3     TRUE  0.28963176      FALSE
## 362        3     3     TRUE  0.23104046      FALSE
## 363        2     1     TRUE  0.06029226      FALSE
## 364        4     4     TRUE  0.40315734      FALSE
## 365        3     2    FALSE  0.12606082      FALSE
## 366        4     3     TRUE  0.24487648      FALSE
## 367        0     2    FALSE  0.10291931      FALSE
## 368        4     4     TRUE  0.40315734      FALSE
## 369        8     4     TRUE  0.47825016      FALSE
## 370        0     2    FALSE  0.10291931      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   236   23
##    TRUE     65   46

The table of the last ten values shows that the model has predicted all of these cases to not have high alcohol consumption based on absences and going out, but the actual observations have 7/10 high alcohol usage. This model is clearly not perfect. Still the table of target variable vs. predictions shows that most cases are predicted correctly by the model.


Then I plot the predictions against the actual values (graphic visualization). If the model was perfect in predicting the actual values, the upper line would have dots only on the right side (probability > 0.5) and lower line only on the left side.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63783784 0.06216216 0.70000000
##    TRUE  0.17567568 0.12432432 0.30000000
##    Sum   0.81351351 0.18648649 1.00000000

According to the table, approximately 26 % of the predictions went wrong (the training error).


6 10-fold cross-validation (Bonus 2p)

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2459459

The average number of wrong predictions in the cross validation is aroun 0.24-0.26 (changes with each run). Therefore the model’s prediction capabilities are very close to the model at Datacamp.


END OF WEEK 3!


Week 4: Clustering and Classification

date()
## [1] "Wed Dec  8 20:32:36 2021"


1 Description of the data (1p)

The data represents Housing Values in Suburbs of Boston from the 70s. Loading the data and taking a look at the data:

library(dplyr)
library(ggplot2)
library(tidyr)
library(boot)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.92 loaded
# load the data
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506  14

There are 506 observations and the following 14 variables:

  • crim: pre capita crime rate by town
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.


2 Graphical overview of the data (2p)

Looking at the summary of the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Making a correlation matrix and visualizing it:

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

Based on the visual overview the strongest negative correlation is between:

  • NO concentration and distance to employment centres
  • proportion of buildings built before 1940 and weighed mean distance to five employment centres
  • lower status of the population and median value of owner-occupied homes

According to this it sounds like the employment centres produce NO emissions, which makes sense if they are based of heavy industry. Also the second is clear: areas with older buildings are built close to the work places and newer buildings are further away since the place is occupied already. More lower status population unsurprisingly correlates with lower value of owner-occupied homes.

And the strongest positive correlation is between:

  • accessibility to radial highways and full-value property-tax rate per $10,000
  • NO concentration and proportion of non-retail business acres per town

These statements are also reasonable: the access to highways indicate higher property-tax rate and industry (non-retail) businesses emit NO emissions.


3 Standardizing the data and categorizing crime rate (2p)

#centre and standardize variables
boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

The mean of the variables is now zero since the standardizing includes dividing all the variables with their means.


Next I will create a factor variable of crime rate and remove the old one, as well as split the data into test and train sets for testing of predictions.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low","med_low","med_high","high"))

# removing the old crime rate variable
boston_scaled <- dplyr::select(boston_scaled, -crim)

# adding the new categorized crime rate variable to the standardized dataset
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows 
n <- nrow(boston_scaled)

# choosing randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# creating the train set
train <- boston_scaled[ind,]

# ... and the test set 
test <- boston_scaled[-ind,]


4 Linear discriminant analysis (3p)

Then I will fit the linear discriminant analysis to the train data with crime rate as the target variable and all the other variables as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2673267 0.2500000 0.2301980 0.2524752 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       1.01365809 -0.9020188 -0.126510540 -0.8773833  0.42708690 -0.9115852
## med_low  -0.08861018 -0.3519789  0.039520456 -0.5763419 -0.08170405 -0.3685727
## med_high -0.43099276  0.3180964  0.193349455  0.4174183 -0.07261634  0.4141706
## high     -0.48724019  1.0171096 -0.002135914  1.0481898 -0.40179899  0.7976710
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9186185 -0.6968812 -0.7146801 -0.48969511  0.37513211 -0.76244089
## med_low   0.3082570 -0.5338554 -0.4509952 -0.04165789  0.32386115 -0.18029794
## med_high -0.3830739 -0.3866442 -0.2426354 -0.13243585  0.04203384  0.05939468
## high     -0.8352580  1.6382099  1.5141140  0.78087177 -0.79350846  0.88685880
##                 medv
## low       0.52065193
## med_low   0.05015199
## med_high -0.02823585
## high     -0.65125598
## 
## Coefficients of linear discriminants:
##                 LD1          LD2        LD3
## zn       0.12107054  0.666620929 -0.9420107
## indus    0.03864901 -0.478369303 -0.2264823
## chas    -0.06610835 -0.050144462  0.2070847
## nox      0.39951497 -0.531165626 -1.3726847
## rm      -0.07932483 -0.060757004 -0.1078376
## age      0.26306297 -0.267049735 -0.2171279
## dis     -0.07292728 -0.175042425 -0.1094167
## rad      2.99518933  0.930812416 -0.2561549
## tax     -0.01831158  0.072771367  0.9635695
## ptratio  0.13394016  0.030206484 -0.1712937
## black   -0.12061832  0.001206622  0.1040602
## lstat    0.23946734 -0.110242026  0.4224073
## medv     0.17355932 -0.213025370 -0.1307237
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9478 0.0396 0.0126


The following represent the LDA biplot:

# making arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plotting the results with arrows
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

LD1 is the first linear discriminant and the LD2 the second. What these exactly represent hopefully becomes clearer next week. The arrows show the direction and extent of the effect of each explanatory variable: distance to the radial highways seems to be the strongest indicator of high crime rate.


5 Prediction of classes (3p)

The correct classes from the test data are saved and then removed here:

# saving the correct classes from test data
correct_classes <- test$crime

# lastly remove the crime variable from test data
test <- dplyr::select(test, -crime)


Then I will use the LDA model for predicting the classes, and cross tabulate the results with the categories from the test set.

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12       7        0    0
##   med_low    2      18        5    0
##   med_high   2      10       21    0
##   high       0       0        0   25


The category ‘high’ is very well predicted by the model. The observations belonging to the category ‘low’ are often predicted as ‘med-low’. In general the model seems pretty good, but due to randomness of defining the test and train groups, the model changes somewhat with every run and therefore the results vary as well. An average over many runs would be needed for more precise estimate of the performance of the model.


6 K-means (4p)

I will load the Boston data again and standardize it, after which I will calculate the distances between the observations.

data('Boston')
boston_scaled2 <- scale(Boston)
#making the data frame out of the data
boston_scaled2 <- as.data.frame(boston_scaled2)

#calculating euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970


Next the K-means algoritm with 3 clusters is run and the results visualized (only columns 6-10 are plotted to see the results better):

km <-kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

#finding the optimal number of clusters:
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Around the x-value 2 in the previous plot, the the radical drop ends and therefore the optimal number of clusters is 2. This produces better groups:

km <-kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

The clusters are the clearest in the variables rad (distance to the radial highways) and tax (property tax-rate). There is a well-separeted group of observations that all have same, very high value of rad. Same goes with the tax. These same observations (in the graph the red ones) appear close to each other also in the other variable pairs. They seem to also have high portion of old buildings, as well as short distance to employment centres.

The other variables not presented in the graph were checked but did not provide additional information for the interpretation of the results.


END OF WEEK 4!


Week 5: Dimensionality reduction techniques

date()
## [1] "Wed Dec  8 20:32:39 2021"


1 Graphical overview and summaries of variables (3p)


The data includes 155 observations and the following 8 variables related to human development globally:

  • GNI: Gross National Income per capita
  • LEB: Life expectancy at birth
  • EYE: Expected years of schooling
  • MMR: Maternal mortality ratio
  • ABR: Adolescent birth rate
  • PRP: Proportion of female representatives in parliament
  • edu2R: Ratio between females and males with at least secondary education (if > 1, larger portion of female are educated than men)
  • labR: Ratio between females and males in the labour force (if > 1, larger portion of female are part of labour force than men)


I will start by reading the data and looking at the graphical overview and summary of the variables:

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
#library(boot)
library(corrplot)
library(FactoMineR)

# load the data
human <- read.csv("data/human.csv", header = TRUE, stringsAsFactors = TRUE, row.names = 1)

# summaries of the variables
summary(human)
##      edu2R             labR             LEB             EYE       
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI              MMR              ABR              PRP       
##  Min.   :  2.00   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.: 53.50   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 99.00   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 98.73   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.:143.50   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :194.00   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# graphical overview with 
p <- ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))
p

Few observations from the data overview:

  • Starting from the first variable, there is a peak at around value 1, which indicated that there is no large difference in the portion of girls being educated vs. boys. However, there is another bumb in the graph somewhere around 0.4 — this means that there are countries where notably larger share of boys are educated compared to girls. This ratio also seems to have a significant, strong correlation with life expectancy and expected years of schooling (not surprisingly), and a strong negative correlation with maternal mortality.
  • Life expectancy at birth peaks around 70 years but there are still many countries with life expectancy around 60 years or so. There is a major positive correlation between it and expected years of education, and negative correlation with maternal mortality and adolescent birth rate.
  • Adolescent birth rate and maternal mortality correlate strongly as well — when having children as an adolescent, the risks of giving birth are major. In general maternal mortality rate has very low values, which can be explained with the generally high quality of health care and maternal care in the world today.
  • Gross national income has quite a large variance, and no clear peak. It has some slightly positive correlation with life expectancy at birth and ratio of female vs male that are educated, but the significance is not strong.


These same observations can be made also from the correlation matrix presented below.

#plotting the correlation matrix
cor_matrix <- cor(human) %>% round(digits = 2)

corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)


2 PCA analysis (2p)


Doing a PCA analysis on the data:

pca_human <- prcomp(human)

# making a summary of the PCA 
s <- summary(pca_human)

# printing rounded percentages of variance captured by each principal component
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 92.3  6.0  1.4  0.3  0.0  0.0  0.0  0.0
# making the lables include the percentages
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1),
       col = c("cornsilk4", "darkgoldenrod"),
       xlab = pc_lab[1],
       ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

PCA biplot of human development data shows the first two principal components, that together cover 98% of variation in the data. Gross National Income has mostly effect on the PC2 and maternal mortality rate is the strongest contributor to the PC1. The other variables have a very small eigenvalues (the vectors).


3 Standartizing the variables & PCA (4p)

Next the variables will be standardized and the same PCA analysis done again.

human_std <- scale(human)
pca_human_std <- prcomp(human_std)

# making a summary of the PCA 
s_ <- summary(pca_human_std)

# printing rounded percentages of variance captured by each principal component
pca_pr_ <- round(100*s_$importance[2, ], digits = 1)
pca_pr_
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 48.3 16.2 12.3  9.4  6.1  3.6  2.7  1.4
# making the lables include the percentages
pc_lab_ <- paste0(names(pca_pr_), " (", pca_pr_, "%)")

# draw a biplot
biplot(pca_human_std,
       cex = c(0.8, 1),
       col = c("cornsilk4", "darkgoldenrod"),
       xlab = pc_lab_[1],
       ylab = pc_lab_[2])

The results are looking quite different. Standardizing brings all the observations to the same range, and therefore it is easier to see the differences and correlations between them, and their ifluence on the principal components. PCA biplot of scaled human development data shows the first two principal components that explain together approximately 64 % of the total variation in the data. After standardizing there are more principal components with some effects (shown in the percentages above the graph) than without standardizing. In general it doesn’t really make sense to analyse the results of non-standardised PCA.


4 Interpretation (2p)

The arrows pointing to the same direction as the principal component axis represent the features contributing to that PC. Then again the arrows having a small angle between their directions are strongly correlated with each other. The length of the arrow tells about the standard deviation of the variable — longer arrow, higher standard deviation.

Therefore features defining the first principal component are expected years of education, ratio between women and men having secondary education, life expectancy at birth and Gross National Income — that all are rather highly correlated with each other — as well as maternal mortality ratio and adolescent birth rate, that are strongly correlated with each other as well. These features pointing to opposite directions also correlate with each other, but negatively. In conclution PC1 tells about level of education and giving birth.

The second principal component is mostly defined by proportion of females in parliament and ratio of female vs male in labour force, aka it tells something about equality in worklife.

Then we can look at the principal components together and in relation to the observations (Countries). The lower half of the graph shows countries with less women in power, and the upper half more. The left hand countries have a high life expectancy and good level on education, and the right hand side countries are not as developed education- or maternity-wise. Following from this, the countries with high education levels and high ratio of women in parliament are located in the upper left part of the graph and can be thought as the most developed, based on these indicators. At the opposite end of the graph, lower right, there are countries with high maternal mortality rates, higher adolescent birth rate and less women in power. These countries can be said to be the least developed.

This analysis complements the descriptive analysis done based on the correlation matrix, and offers more insight to the data — PCA analysis helps in seeing overarching themes rising from the data.


5 MCA on tea data (4p)

Checking on the tea data and Multiple Correspondence Analysis on the data:

data(tea)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea) # observations 300, variables 36
## [1] 300  36
# selecting the following columns to keep for the analysis 
keep_columns <- c("Tea", "How", "how", "sophisticated", "where", "spirituality")

# creating the new dataset based on the wanted variables
#FOR SOME REASON THIS PART IS CRASHING WHEN I KNIT THE index FILE, BUT WORKS IF I ONLY KNIT chapter5 FILE...

#tea_time <- select(tea, one_of(keep_columns))

#MCA on the data
#mca <- MCA(tea_time, graph = FALSE)

#making a plot about the results
#plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

Since I had problems with running the index file even though the scripts works well when kinning only chapter5 file, I will attach a picture of the results instead:

MCA plot of the selected variables of the tea data.

From the visualization we can see that using tea bags for making the tea goes hand in hand with buying the tea from a chain store, and similarly using unpacked tea and shopping in specific tea shops are very close. These make a lot of sense since most loose-leaf teas are sold in tea shops.

Black tea is close to sophisticated and milk usage rather close to not-sophisticated. Green tea and using lemon both stand out from the other variables.

These are maybe not the most interesting variables for the Multiple Correspondence analysis, and some others could have been chosen instead.


END OF WEEK 5!


Week 5: Analysis of longitudinal data

date()
## [1] "Wed Dec  8 20:32:45 2021"


1 RATS (7p)


First we will take a look at data about rats and their nutrition. The data consists of body weight of three groups of rats measured over a 9-week period (64 days). The data was already converted from the wide to long format in the data wrangling part of the exercise. We start by reading the data in, and checking its structure. Variables ID and Group are changed to factors and the summary of the data shown. After that the data is drawn so that the groups of rats are in different graphs.

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#library(boot)
#library(corrplot)
#library(FactoMineR)

# load the data (saved in long form)
RATSL <- read.table("data/RATS.txt", header = TRUE)

#wide format (original data) for later use
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep = "\t", header = T)

#checking the data structure
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
# making factors out of ID and Group variables
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

# summaries of the variables
summary(RATSL)
##        ID      Group       WD                Weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110
# drawing the data, one group in one graph
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() + 
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) + 
  scale_y_continuous(name = "Weight (grams)") + 
  theme(legend.position = "none")

Already these graphs shows that the groups have rather different values of weight. First group of rats is the lightest and the last group the heaviest overall. The group 2 has one rat that is throughout the time series, notably heavier than the others — even the group 3 rats. Not surprisingly, the weight of all the rats is increasing during the test period of 9 weeks.



Next the data will be standardized and same graph is produced again.

# standardizing the data
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight - mean(Weight))/ sd(Weight)) %>%
  ungroup()

# drawing the standardized data
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() + 
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) + 
  scale_y_continuous(name = "Standardized weight") + 
  theme(legend.position = "none")

Now the growing trend is not so clear anymore.



Then a summary graph of the data will be produced.

# number of days, baseline (day 1) included
n <- RATSL$Time %>% unique() %>% length()

# summary data with mean and standard error of Weight by Group and time 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# taking a glimpse at the summary data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30…
# plotting the mean profiles with errorbars
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.4)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

All of the lines are clearly separated and do not overlap, which suggests a difference between the groups.



Then we’re taking a summary measure approach looking into differences between groups, and producing boxplots of the mean values of each group.

# creating a summary data by Group and ID with mean as the summary variable (excluding the baseline)
RATSL8S <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# draw a boxplot of the mean versus Group
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), 64 days")
## Warning: `fun.y` is deprecated. Use `fun` instead.

From the graph above we can see that one of the mean values in Group 2 is notably higher than others — this was visible already from the graphs including all the data. Let’s leave that outlier out of the analysis to avoid bias it might cause to the conclusions.

# create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- RATSL8S %>% filter(mean < 570)

ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), 64 days")
## Warning: `fun.y` is deprecated. Use `fun` instead.

After removing the outlier, there is no overlap in any of the rats’ weight lines with other groups than their own. To check the validity of the conclusion that there is significant difference between the groups, we will create a linear regression model with the baseline (starting point weight) and the group as explanatory variables to the mean weight variable, and compute the variance analysis.

# add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATS$WD1)

# fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)

# compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis shows that the baseline correlated strongly with the mean weight — which makes sense — and that the group has a slightly significant correlation with it — 0.076 is at the border of not being significant, but is still something.



2 BPRS (8p)


Now we have a new data set at hand. It represents 40 male subjects that we receiving different treatment for eight weeks during which they were rated on the brief psychiatric rating scale (BPRS) — once before the treatment and then weekly. The scale is used to evaluate whether or not the pacient has schizophrenia.

First we will take a look at the data, change variables treatment and subject into factors, and make a graph showing the time series of the two treatment groups.

# load the data (saved in long form)
BPRSL <- read.table("data/BPRS.txt", header = TRUE)

#checking the data structure
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
# making factors out of treatment and subject variables
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

# summaries of the variables
summary(BPRSL)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252
# graph showing all the data
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

There seem to be a downward trend in the BPRS during the study with almost all the patients. Also the lower values in the beginning seem to lead to smaller values in the end as well. However, the differences are large between the subjects.



It’s time to look at some different models of the data. First we will fit a linear mixed model to the data without taking the into account the repeated nature of the data.

# create a regression model RATS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

This model is not to be trusted since we know that there is very likely correlation between different measurements of the same subject. This correlation can be taken into account by a random factor.



To account for this we can use a random intercept model. The random component is assumed to be normally distributed and be constant in time. This allows the linear fit for each subject to differ in intercept from other subjects.

# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

The estimated stardard error of week is smaller than with the linear mixed model, which comes down to the time-correlation being taken into account. This is still not good enough for us since random intercept model doesn’t often represent the observed pattern of variance and correlations between measurements well in longitudinal data.



To allow heterogeneity in both intercepts and slopes, we can use a random intercept and random slope model. The two random effects are assumed to have a bivariate normal distribution.

# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value associated with the chi-squared statistic given by the likelihood ratio test (anova) is quite small (*), which tells about the latter model providing a better fit for the data.



Final model to make is a random intercept and random slope model that allows interaction between treatment x week.

# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | treatment), data = BPRSL, REML = FALSE)
## boundary (singular) fit: see ?isSingular
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | treatment)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2843.3   2870.5  -1414.7   2829.3      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8252 -0.7277 -0.2586  0.5697  4.0818 
## 
## Random effects:
##  Groups    Name        Variance  Std.Dev. Corr 
##  treatment (Intercept) 2.824e-02  0.16804      
##            week        1.765e-03  0.04201 -1.00
##  Residual              1.516e+02 12.31302      
## Number of obs: 360, groups:  treatment, 2
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.3664  33.996
## week         -2.2704     0.2531  -8.971
## treatment2    0.5722     1.2979   0.441
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.741       
## treatment2 -0.475  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
# perform an ANOVA test on the two models - with interaction and without
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ week + treatment + (week | treatment)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## BPRS_ref2    7 2843.3 2870.5 -1414.7   2829.3                     
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 97.896  0

Here anova gives us 0 degrees of freedom, and therefore no p-value. My assumption is that the random intersept and random slope model with interaction is therefore not useful for this data.


Let’s therefore use the previous model, the random intersept and random slope model, for finding fitted values and plotting them after the observed values for visual comparison.

# draw the plot of RATSL with the observed Weight values
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_x_continuous(name = "Weeks") +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "none") +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both)

# create a vector of the fitted values from the wanted model
Fitted <- fitted(BPRS_ref1)

# create a new column fitted to RATSL
BPRSL <- BPRSL %>% mutate(fitted = Fitted)

# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
  geom_line() +
  scale_x_continuous(name = "Weeks") +
  scale_y_continuous(name = "Fitted BPRS") +
  theme(legend.position = "none") +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both)

I tried also plotting the fitted values based on the first model with only intersept related random effect but that leads to all the fitted lines being parallel (since no slope randomness is allowed). Then again the model including also interaction doesn’t produce anything as shown earlier. In the end, the best model is definitely this one plotted here even though it is not perfect either. In general, I would conclude that the linear models are not maybe the best here, but some more complicated correlation could be found. What can be said for sure is that there is no clear evidence of either treatment working better than the other (or differently in any way for that matter).


Thank you for the course!



THE END!